Main goals of the session
The main aim of this session is to understand the importance of “substitution models” in molecular evolution. In particular, exercises will focus on the important distinction between the number of differences observed when comparing nucleotide sequences and the actual number of substitutions that have occurred in their divergence. First, you will work with the Jukes and Cantor model and the famous distance correction based on this model. You will also simulate nucleotide sequences under a more complex substitution model and use “Model Test” to select the best-fitting model underlying the data.
In this lesson we will assume that
The length of the sequences studied is finite (i.e. multiple mutations can occur at the same position, some of them even back to an ancestral nucleotide).
The sequences studied have evolved independently since their divergence from a common ancestor (i.e. the substitutions that have occurred in each descendant lineage are independent).
the mutation rate is constant over time and the same for all lineages and positions.
Throughout the document you will see different icons whose meaning is:
: Additional or useful information
: Practical exercise
: Hint to solve an exercise or to do a task
: Slot to answer a question
: Problem or task to be solved
You can use either the R console in the terminal or
RStudio for this exercise. If you don’t have R
installed, you can download the appropriate package for your system here. To install
RStudio, go to this page and
follow the instructions.
Before starting the exercises, you will need to install the
phangorn library. This package includes a function for
comparing different nucleotide substitution models.
Open the R console in the terminal (or in
RStudio) and type:
#install.packages("phangorn") # Do not run if you have installed phangorn in practice 4.
The most popular and flexible software for performing model selection is ModelTest (Darriba et al. 2019). The latest version of this program can be downloaded here. However, for practical reasons, we decided not to use this software in these exercises, as it requires compilation and depends on some third-party dependencies.
Given that DNA sequences can only present four different states (A, T, C, G), the number of observed and actual substitutions can be different when comparing sequences of finite length (the more divergent the sequences the greater this difference will be).
Let’s see an example of this problem. Read carefully the
explanation of the different steps we go through for the completion of
the simulations and reproduce the process in the R terminal
(or RSsudio):
subst.rate <- 1e-6
G <- 1e6
L <- 100
n.subst.lineage.1 <- rpois(n=1,lambda=subst.rate*L*G)
n.subst.lineage.1
## [1] 114
n.subst.lineage.2 <- rpois(n=1,lambda=subst.rate*L*G)
n.subst.lineage.2
## [1] 103
total.subst <- n.subst.lineage.1 + n.subst.lineage.2
total.subst
## [1] 217
Figure 1. Expected number of substitutions in two lineages after their split from a common ancestor. The actual number of substitutions in each lineage will be a random sampling from a Poisson distribution with lambda = r x T
1. Which would be the actual evolutionary divergence (K= number of substitutions per site) between the two descendant sequences in your simulated replicate?
The evolutionary divergence would be 2.17
2. If you could compare the sequences after time G, would you see all these substitutions? Why?
No, the amount of changes can't be observed with only the resulting sequence.
3. Should the estimate in Question 1 be corrected, and if so, which substitution model would you use?
Yes, the estimate should be corrected to account for multiple hits. I would use Jukes and Cantor as it is a simple method
In this section you will simulate the evolution of two DNA sequences under the same scenario as in Example 1 and the Jukes and Cantor model. In this case, you start with an ancestral sequence (a simulated sequence; 100 bp), estimate the number of substitutions in the two lineages, randomly choose a position in the sequence to introduce the substitution, and replace the existing nucleotide with one of the other three (also randomly).
Enter the R console and type the following
commands:
Tnt <- c("A","C","G","T")
L <- 100
seq <- sample(x=Tnt, size=L, replace=TRUE, prob=c(0.25,0.25,0.25,0.25))
seq
## [1] "T" "T" "C" "C" "G" "G" "A" "A" "C" "A" "A" "G" "C" "T" "T" "C" "C" "G"
## [19] "C" "A" "C" "A" "C" "T" "T" "C" "T" "T" "A" "A" "A" "T" "C" "C" "C" "T"
## [37] "T" "A" "G" "G" "A" "T" "G" "T" "T" "A" "T" "C" "T" "A" "A" "G" "T" "A"
## [55] "C" "G" "A" "G" "G" "A" "C" "C" "A" "T" "C" "C" "A" "A" "C" "T" "C" "T"
## [73] "C" "T" "T" "A" "T" "A" "T" "C" "G" "A" "A" "A" "A" "A" "G" "C" "C" "T"
## [91] "T" "A" "T" "A" "A" "T" "A" "C" "C" "A"
Q:Q <- matrix(1/3, ncol=4, nrow=4, dimnames=list(c("A","C","G","T"), c("A","C","G","T")))
for(x in 1:4) {Q[x,x] <- 0; Q[x,x] <- -sum(Q[x,])}
Q
## A C G T
## A -1.0000000 0.3333333 0.3333333 0.3333333
## C 0.3333333 -1.0000000 0.3333333 0.3333333
## G 0.3333333 0.3333333 -1.0000000 0.3333333
## T 0.3333333 0.3333333 0.3333333 -1.0000000
Let’s write an R
function defined as:
substitutions.on.sequence () to randomly choose the
specific nucleotide change (e.g. A->G or A->C, etc..) for each
substitution and the site among the L positions in the ancestral
sequence seq where the change occurred (=to generate the
two descendant sequences). The parameters of this function will be:
the ancestral sequence, the JC rate matrix and the
four possible states (nucleotides):
substitutions.on.sequence <- function(seq,substitutions,Q,Tnt) {
seqis a vector of symbols with length L in which each position can only take four possible states (e.g.., A, T, A, G, C …etc.),substitutionsis the total number of expected substitutions,Qis the instantaneous substitution rate matrix under a particular model (here the JC) andTntis a vector of C values containing the different symbols accepted in the sequence (the four nucleotides).
The script continues as follows (NOTE that you defined
seqin the example 2):
seq1 <- seq
len <- length(seq)
now, we select the sites where the substitutions will fall on (here we use a uniform probability, that is, all sites are equally likely to mutate):
position.substitution <- floor(runif(n=substitutions ,min=1, max=len))
then, we check the nucleotide that is in the position that will
mutate(nti = the position in the matrix that corresponds to
the nucleotide that will change, e.g. A = 1) and therefore, which
nucleotides can replace it (’rest` = the positions in the matrix that
correspond to the three possible changes, e.g. C=2, T=3 and G=4):
number.symbols <- c(1:length(Tnt))
for(ps in position.substitution) {
nti<-which(Tnt==seq1[ps])
rest.symbols<-number.symbols[-nti]
In the transition matrix Q, the sum of instantaneous
rates for each of the three possible changes (SQ) is equal
to 1 (remember that in Q, we do not consider the diagonal
for this sum). Consider cQ as the cumulative probability of
each of these changes (this is a trick to sample from an uniform
distribution with probabilities 0.3333; see the loop below):
SQ<- 1
cQ<- 0
ntj<- 1
based on this probabilities, which is the new nucleotide (the
ntj position in the matrix) after the substitution?:
ranp <- runif(1)
while(ntj > 0) {
calculate the (cumulative) probability of a substitution from
nucleotide nti to ntj:
cQ<-cQ+Q[nti,rest.symbols[ntj]]/SQ
if(ranp < cQ) {
seq1[ps] <- Tnt[rest.symbols[ntj]]
ntj <- -1
}
ntj <- ntj+1
}
} seq1
}
Here you can find the complete R code for the function:
substitutions.on.sequence<-function(seq,substitutions,Q,Tnt) {
seq1 <- seq
len <- length(seq)
position.substitution <- floor(runif(n=substitutions,min=1,max=len))
number.symbols <- c(1:length(Tnt))
for(ps in position.substitution) {
nti <- which(Tnt==seq1[ps])
rest.symbols <- number.symbols[-nti]
SQ <- sum(Q[nti,-nti])
cQ <- 0
ntj <- 1
ranp <- runif(1)
while(ntj > 0) {
cQ <- cQ+Q[nti,rest.symbols[ntj]]/SQ
if(ranp < cQ) {
seq1[ps] <- Tnt[rest.symbols[ntj]]
ntj <- -1
}
ntj <- ntj+1
}
}
seq1
}
seq.ancestral <- seq
seq.lineage.1 <- substitutions.on.sequence(seq.ancestral,n.subst.lineage.1,Q,Tnt)
seq.lineage.2 <- substitutions.on.sequence(seq.ancestral,n.subst.lineage.2,Q,Tnt)
seq.ancestral
## [1] "T" "T" "C" "C" "G" "G" "A" "A" "C" "A" "A" "G" "C" "T" "T" "C" "C" "G"
## [19] "C" "A" "C" "A" "C" "T" "T" "C" "T" "T" "A" "A" "A" "T" "C" "C" "C" "T"
## [37] "T" "A" "G" "G" "A" "T" "G" "T" "T" "A" "T" "C" "T" "A" "A" "G" "T" "A"
## [55] "C" "G" "A" "G" "G" "A" "C" "C" "A" "T" "C" "C" "A" "A" "C" "T" "C" "T"
## [73] "C" "T" "T" "A" "T" "A" "T" "C" "G" "A" "A" "A" "A" "A" "G" "C" "C" "T"
## [91] "T" "A" "T" "A" "A" "T" "A" "C" "C" "A"
seq.lineage.1
## [1] "C" "C" "C" "A" "G" "C" "T" "C" "A" "A" "T" "T" "A" "G" "T" "A" "T" "G"
## [19] "G" "T" "T" "T" "C" "A" "G" "C" "C" "T" "T" "T" "A" "C" "G" "C" "A" "G"
## [37] "A" "A" "C" "T" "C" "T" "G" "G" "C" "A" "G" "C" "A" "A" "A" "A" "A" "A"
## [55] "G" "T" "A" "T" "T" "A" "G" "T" "A" "G" "T" "C" "C" "G" "T" "T" "G" "A"
## [73] "C" "T" "G" "C" "A" "A" "T" "C" "G" "A" "T" "A" "A" "G" "A" "G" "G" "G"
## [91] "A" "A" "A" "A" "C" "C" "C" "A" "T" "A"
seq.lineage.2
## [1] "C" "G" "T" "C" "T" "G" "A" "T" "C" "G" "A" "G" "C" "G" "C" "T" "C" "G"
## [19] "T" "G" "C" "T" "C" "T" "G" "G" "A" "C" "A" "A" "T" "C" "C" "C" "C" "A"
## [37] "T" "T" "C" "G" "A" "T" "A" "T" "C" "A" "T" "C" "C" "A" "A" "G" "A" "T"
## [55] "A" "A" "A" "A" "G" "T" "G" "C" "A" "T" "A" "G" "T" "T" "T" "A" "G" "C"
## [73] "T" "C" "A" "A" "C" "G" "T" "C" "G" "C" "T" "A" "A" "T" "T" "A" "T" "T"
## [91] "C" "A" "A" "A" "C" "C" "C" "A" "T" "A"
Diff <- matrix(0, nrow=3, ncol=2, dimnames=list(c("1 vs.2","anc vs.1","anc vs.2"),c("OBS","ACTUAL")))
Diff[1,1] <- sum(seq.lineage.1 != seq.lineage.2)
Diff[2,1] <- sum(seq.ancestral != seq.lineage.1)
Diff[3,1] <- sum(seq.ancestral != seq.lineage.2)
Diff[1,2] <- n.subst.lineage.1 + n.subst.lineage.2
Diff[2,2] <- n.subst.lineage.1
Diff[3,2] <- n.subst.lineage.2
Diff
## OBS ACTUAL
## 1 vs.2 64 217
## anc vs.1 65 114
## anc vs.2 57 103
In the Jukes and Cantor model, the relationship between observed (p) and actual (K) number of nucleotide substitutions per site is given by the following equation:
Using this formula, we can calculate the theoretical curve of the relationship between these two quantities:
subst.rate <- 1e-6
par(mfrow=c(1,1))
G.list <- seq(from=1e3, to=1e6, by=(1e6-1e3)/29)
y1 <- seq(from=0,to=2*subst.rate*G.list[30], by=2*subst.rate*G.list[30]/29)
plot(x=2*G.list, y=y1, ylim=c(0,2), type="l", col="red", xlab="Time (years)", ylab="Divergence")
y2<-c(3/4-3/4*exp(-4/3*subst.rate*2*G.list)) # isolate p from the formula
lines(x=2*G.list, y=y2, ylim=c(0,2), type="l", col="black")
The number of simulated substitutions (red circles) and differences (black circles) per site in a sequence of 100 bp (using the R function generated above):
L<-1e3
seq.ancestral<-sample(x=Tnt,size=L,replace=TRUE,prob=c(0.25,0.25,0.25,0.25))
Q.JC <-matrix(1/3, ncol=4, nrow=4,dimnames=list(c("A","C","G","T"), c("A","C","G","T")))
for(x in 1:4) {Q.JC[x,x] <-0; Q.JC[x,x]<- -sum(Q.JC[x,])}
G.list vector, see above):p.obs <- array(0,30)
K.act <- array(0,30)
i<-1
for(g in G.list) {
n.subst.lineage.1 <- rpois(n=1,lambda=subst.rate*L*g)
n.subst.lineage.2 <- rpois(n=1,lambda=subst.rate*L*g)
seq.lineage.1 <- substitutions.on.sequence(seq.ancestral,n.subst.lineage.1,Q.JC,Tnt)
seq.lineage.2 <- substitutions.on.sequence(seq.ancestral,n.subst.lineage.2,Q.JC,Tnt)
p.obs[i] <- sum(seq.lineage.1 != seq.lineage.2)/length(seq.ancestral)
K.act[i] <- (n.subst.lineage.1+n.subst.lineage.2)/length(seq.ancestral)
i<-i+1
}
y1 <- seq(from=0,to=2*subst.rate*G.list[30], by=2*subst.rate*G.list[30]/29)
plot(x=2*G.list, y=y1, ylim=c(0,2), type="l", col="red", xlab="Time (years)", ylab="Divergence")
y2<-c(3/4-3/4*exp(-4/3*subst.rate*2*G.list))
lines(x=2*G.list, y=y2, ylim=c(0,2), type="l", col="black")
lines(x=2*G.list,y=p.obs ,type="p",col="black")
lines(x=2*G.list,y=K.act, type="p",col="red")
4. Why don’t the simulated values fit perfectly with those expected from the theory (theoretical curves)?
Inherent variability of genetic sequences
5. Based on the plot, which is the main consequence of not correcting divergences for multiple hits when calculating evolutionary rates?
Due to the variance not correcting for hits will cause a cascading effect where hits will be more divergent every time
6. Looking at the black curve, do you think there is any limit to the Jukes and Cantor correction? Could this correction be applied in all cases, regardless of divergence times?
Corrections in Jukes and Cantor are limited because when divergent time increses, so do multiple hits and the model may no t account for all the complexities of the substitution process. This is also why it can't be applied to all cases regardless of divergent times
Let’s now work with a more complex model, the HKY85 model Hasegawa, Kishino and Yano 1985, which allows unequal base frequencies and distinguishes between transition and transversion rates.
p.A <- 0.55
p.C <- 0.06
p.G <- 0.36
p.T <- 1-(p.A+p.C+p.G)
p.S <- 0.75
p.V <- 1-p.S
Q.HKY <- c(0,p.C*p.V,p.G*p.S,p.T*p.V,p.A*p.V,0,p.G*p.V,p.T*p.S,p.A*p.S,p.C*p.V,0,p.T*p.V,p.A*p.V,p.C*p.S,p.G*p.V,0)
Q.HKY <- matrix(Q.HKY,ncol=4,nrow=4,byrow=TRUE,dimnames=list(c("A","C","G","T"), c("A","C","G","T")))
for(x in 1:4){Q.HKY[x,x] <- 0; Q.HKY[x,x] <- -sum(Q.HKY[x,])}
Q.HKY
## A C G T
## A -0.2925 0.015 0.270 0.0075
## C 0.1375 -0.250 0.090 0.0225
## G 0.4125 0.015 -0.435 0.0075
## T 0.1375 0.045 0.090 -0.2725
Simulate an ancestral DNA sequence
(ex4.ancestral.seq) and two descendant sequences
(ex4.seq.lineage.1 and ex4.seq.lineage.2) as
in the example 2 (step 1) but using the nucleotide frequencies
in the example 4. Then, use function
substitutions.on.sequence with the Q.HKY
matrix to generate two descendant sequences after G
generations.
Generate a file (simulated.fasta) with the three
sequences in FASTA format (the ancestral and the two
descendant sequences). To do that, you can use the function
writeLines() with sep="", to print the clean
sequence in the screen. Copy and paste the sequences in the file in a
correct fasta format.
SeqGen<-function(f_A, f_C, f_G, f_T, L){
Tnt <- c("A","C","G","T")
seq <- sample(x=Tnt, size=L, replace=TRUE, prob=c(f_A,f_C,f_G,f_T))
seq
}
substitutions.on.sequence <- function(seq, Q, G) {
for (gen in 1:G) {
pos <- sample(1:length(seq), 1)
current_base <- seq[pos]
new_base <- sample(c("A", "C", "G", "T"), 1, prob = Q[rownames(Q) == current_base, ])
seq[pos] <- new_base
}
return(seq)
}
ex4.ancestral.seq<-SeqGen(p.A, p.C, p.G, p.T, 100)
Q<-matrix(c(0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0), nrow=4, byrow=TRUE)
colnames(Q)<-rownames(Q)<-c("A", "C", "G", "T")
generations<-1000
ex4.seq.ancestral <- seq
ex4.seq.lineage.1 <- substitutions.on.sequence(ex4.ancestral.seq, Q, generations)
ex4.seq.lineage.2 <- substitutions.on.sequence(ex4.ancestral.seq, Q, generations)
f<-file("simulated.fasta")
writeLines(c(">Ancestral\n", ex4.seq.ancestral, "\n>S1\n", ex4.seq.lineage.1, "\n>S2\n", ex4.seq.lineage.2), sep="", f)
close(f)
modelTest() from the
package phangorn to find the best-fit substitution model
underlying the simulated alignment. This program compares the
probability of observing the data under different substitution models
and estimates the matrix and the base frequencies for the best
model.library(phangorn)
## Loading required package: ape
data<-read.phyDat(file="simulated.fasta", format="fasta")
test<-modelTest(data)
## Model df logLik AIC BIC
## JC 3 -412.7676 831.5351 839.3507
## JC+I 4 -412.2865 832.573 842.9937
## JC+G(4) 4 -412.4947 832.9894 843.4101
## JC+G(4)+I 5 -412.312 834.624 847.6499
## F81 6 -411.3391 834.6783 850.3093
## F81+I 7 -410.8956 835.7913 854.0275
## F81+G(4) 7 -411.088 836.1759 854.4121
## F81+G(4)+I 8 -410.9202 837.8403 858.6817
## K80 4 -411.009 830.0181 840.4387
## K80+I 5 -410.5436 831.0872 844.1131
## K80+G(4) 5 -411.0817 832.1634 845.1893
## K80+G(4)+I 6 -410.5685 833.1369 848.7679
## HKY 7 -409.6384 833.2767 851.5129
## HKY+I 8 -409.1713 834.3425 855.1839
## HKY+G(4) 8 -409.7322 835.4645 856.3058
## HKY+G(4)+I 9 -409.1797 836.3595 859.806
## TrNe 5 -410.3167 830.6335 843.6593
## TrNe+I 6 -409.7014 831.4029 847.0339
## TrNe+G(4) 6 -410.3156 832.6312 848.2622
## TrNe+G(4)+I 7 -409.7002 833.4004 851.6366
## TrN 8 -408.9192 833.8383 854.6797
## TrN+I 9 -408.3206 834.6411 858.0877
## TrN+G(4) 9 -408.9179 835.8357 859.2823
## TrN+G(4)+I 10 -408.3189 836.6378 862.6895
## TPM1 5 -409.5038 829.0077 842.0335
## TPM1+I 6 -409.0069 830.0139 845.6449
## TPM1+G(4) 6 -409.4921 830.9843 846.6153
## TPM1+G(4)+I 7 -408.997 831.9941 850.2303
## K81 5 -409.5038 829.0077 842.0335
## K81+I 6 -409.0069 830.0139 845.6449
## K81+G(4) 6 -409.4921 830.9843 846.6153
## K81+G(4)+I 7 -408.997 831.9941 850.2303
## TPM1u 8 -408.1641 832.3282 853.1695
## TPM1u+I 9 -407.6654 833.3309 856.7774
## TPM1u+G(4) 9 -408.1523 834.3045 857.7511
## TPM1u+G(4)+I 10 -407.6565 835.3129 861.3646
## TPM2 5 -410.9451 831.8902 844.9161
## TPM2+I 6 -410.4724 832.9449 848.5759
## TPM2+G(4) 6 -410.9584 833.9169 849.5479
## TPM2+G(4)+I 7 -410.5701 835.1402 853.3764
## TPM2u 8 -409.6384 835.2768 856.1182
## TPM2u+I 9 -409.1709 836.3417 859.7883
## TPM2u+G(4) 9 -409.7304 837.4608 860.9074
## TPM2u+G(4)+I 10 -409.2057 838.4114 864.4631
## TPM3 5 -410.6606 831.3211 844.347
## TPM3+I 6 -410.135 832.27 847.901
## TPM3+G(4) 6 -410.7457 833.4913 849.1223
## TPM3+G(4)+I 7 -410.1335 834.2669 852.5031
## TPM3u 8 -409.2563 834.5126 855.354
## TPM3u+I 9 -408.7334 835.4669 858.9134
## TPM3u+G(4) 9 -409.3134 836.6268 860.0733
## TPM3u+G(4)+I 10 -408.7317 837.4635 863.5152
## TIM1e 6 -408.6682 829.3363 844.9673
## TIM1e+I 7 -407.8555 829.711 847.9472
## TIM1e+G(4) 7 -408.6465 831.2931 849.5293
## TIM1e+G(4)+I 8 -407.8263 831.6526 852.494
## TIM1 9 -407.4022 832.8045 856.251
## TIM1+I 10 -406.7374 833.4748 859.5265
## TIM1+G(4) 10 -407.3871 834.7742 860.8259
## TIM1+G(4)+I 11 -406.7184 835.4367 864.0936
## TIM2e 6 -410.2468 832.4936 848.1246
## TIM2e+I 7 -409.6277 833.2553 851.4915
## TIM2e+G(4) 7 -410.2458 834.4915 852.7277
## TIM2e+G(4)+I 8 -409.6266 835.2531 856.0945
## TIM2 9 -408.9076 835.8152 859.2617
## TIM2+I 10 -408.3128 836.6257 862.6774
## TIM2+G(4) 10 -408.9062 837.8125 863.8642
## TIM2+G(4)+I 11 -408.3111 838.6222 867.2791
## TIM3e 6 -410.1373 832.2746 847.9056
## TIM3e+I 7 -409.5169 833.0337 851.2699
## TIM3e+G(4) 7 -410.1357 834.2713 852.5075
## TIM3e+G(4)+I 8 -409.515 835.0301 855.8714
## TIM3 9 -408.6821 835.3641 858.8107
## TIM3+I 10 -408.0775 836.1549 862.2066
## TIM3+G(4) 10 -408.6803 837.3606 863.4123
## TIM3+G(4)+I 11 -408.0753 838.1505 866.8074
## TVMe 7 -408.7245 831.4491 849.6853
## TVMe+I 8 -408.0109 832.0217 852.8631
## TVMe+G(4) 8 -408.6563 833.3127 854.154
## TVMe+G(4)+I 9 -408.0248 834.0497 857.4962
## TVM 10 -407.3962 834.7924 860.8442
## TVM+I 11 -406.7325 835.465 864.1218
## TVM+G(4) 11 -407.3325 836.665 865.3219
## TVM+G(4)+I 12 -406.7443 837.4885 868.7506
## SYM 8 -408.2488 832.4977 853.339
## SYM+I 9 -407.4052 832.8105 856.257
## SYM+G(4) 9 -408.2177 834.4355 857.882
## SYM+G(4)+I 10 -407.3973 834.7946 860.8463
## GTR 11 -406.8772 835.7544 864.4113
## GTR+I 12 -406.0849 836.1698 867.4318
## GTR+G(4) 12 -406.8465 837.693 868.9551
## GTR+G(4)+I 13 -406.0717 838.1434 872.0107
best_model<- as.pml(test)
best_model
## model: JC
## loglikelihood: -412.7676
## unconstrained loglikelihood: -376.4671
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.25 0.25 0.25 0.25
7. Is the HKY the best model?
No, if we look at the modelTest output the best model is Jukes-Cantor, this is because it has lower AIC and BIC values compared to the HKY model.
9. Do you think it is possible that Model Test estimate a model different from the HKY despite having simulated our sequences under this model? Why?
Yes, it can happen, it may be because the data would fit better with another model like JC.
10. Are the parameters of the model (probabilities and
frequencies) correctly estimated by Model Test (=are the values similar
to those of matrix Q.HKY)?
No, because the Model Test (Jukes-Cantor) assumes the same frequency for all bases (0.25), and in the Q.HKY matrix we can see that the base frequecies are not equal, indicating a mismatch in the model estimation.
Deliver this document in AULAESCI with your answers
Deadline: June 28, 2024 - 23:59
Doubts? alejandro.sanchez@prof.esci.upf.edu